1 Preprocessing

Raw Affymetrix HTA 2.0 array intensity data were analyzed using open-source Bioconductor packages on R. The quiescence and the RAS OIS time series data were normalized all together (2 coenditions, 2 biological replicates per condition, 6 time points per replicates) using the robust multi-array average normalization approach implemented in the oligo package. Internal control probe sets were removed and average expression deciles over time-points were then defined independently for each treatment. Probes whose average expression was lower than the 4th expression decile in both conditions were removed for subsequent analyses. To remove sources of unwanted variation and consider batch effects, data were finally corrected with the sva package. Probes were then annotated using the information provided by Ensembl through biomaRt. Principal component analysis and bi-clustering based on Pearson’s correlation and Ward’s aggregation criterion were used to check for consistency between biological replicates and experimental conditions at each step of the pre-processing.

1.1 Locating data on GEO

The transcriptomic data related to our study are available on Gene Expression Omnibus under the accession GSE112084.

1.3 Getting annotations

The whole probeset on the array are annotated using the infomation (associated gene HGCNC symbols, Entrez IDs, and genomic coordinates on the genome assembly GRCh38.p7 aka hg38),provided on the Ensembl database (version 86 - October 2016) through the package biomaRt.

1.3.2 Bioconductor annotation

# Load the package Bioconductor
library(hta20transcriptcluster.db)

probes <- keys(hta20transcriptcluster.db)

# Mapping between probe names and HGNC gene symbols
symbol <- mapIds(hta20transcriptcluster.db,
                 keys = probes,
                 keytype = "PROBEID",
                 column = "SYMBOL")

# Mapping between probe names and gene Entrez gene IDs
entrez <- mapIds(hta20transcriptcluster.db,
                 keys = probes,
                 keytype = "PROBEID",
                 column = "ENTREZID")

# Mapping between probe names and gene Ensembl gene IDs
ensembl <- mapIds(hta20transcriptcluster.db,
                  keys = probes,
                  keytype = "PROBEID",
                  column = "ENSEMBL")

# Mapping between probe names and chromosome the gene is located in 
chromosome <- mapIds(hta20transcriptcluster.db,
                     keys = probes,
                     keytype = "PROBEID",
                     column = "CHR")

# Mapping between probe names and gene start (bp)
start <- abs(as.numeric(mapIds(hta20transcriptcluster.db,
                    keys = probes,
                    keytype = "PROBEID",
                    column = "CHRLOC")))

# Mapping between probe names and gene end (bp)
stop <- abs(as.numeric(mapIds(hta20transcriptcluster.db,
                   keys = probes,
                   keytype = "PROBEID",
                   column = "CHRLOCEND")))

# Creating a global annotation matrix
annotation_bioconductor <- data.frame(Probe_ID = probes,
                                      Ensembl_ID = ensembl,
                                      Entrez_ID = entrez,
                                      HGNC_Symbol = symbol,
                                      Chromosome = chromosome,
                                      Start = start,
                                      Stop = stop,
                                      stringsAsFactors = FALSE)

# Formating
annotation_bioconductor$Probe_ID <- gsub("hg.1$", "hg", annotation_bioconductor$Probe_ID)

# Exporting
write.table(annotation_bioconductor, "./results/MICROARRAY_ANNOTATION_BIOCONDUCTOR.txt",
            quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t")

1.4 Applying robust multi-array normalization

Once the data are loaded, we can run the Robust Multiarray Normalization (RMA). This type of normalization is supported by the package oligo. This procedure, which include a quantile normalization approach, aimed to polish the median expression between arrays. Notice the use of the option target = "core" which set the analysis at the gene level.

1.5 Removing control and lowly expressed probesets

To avoid considering irrelevant probesets in forthcoming analyses, it is important to filter the expression matrix obtained after applying the RMA. There are two kinds of probesets we want to remove :

  • internal Affymetrix control probesets: we can easily find their IDs in the annotation package pd.hta.2.0;
  • lowly expressed probesets: one empirical rule that is usually applied when working with transcriptomic microarray data is to consider that, in any given tissue - cell type - condition, up to 60-80% of the genes are expressed. To get rid of most of the noise, we decided to apply the 60% cutoff.

1.6 Visual inspection of RMA effect

After these pre-processings, to properly evaluate the impact of the RMA, it is important to graphically visualize the data before and after normalization. Several visual exploration should be used :

  • Box plot;
  • Unsupervised clustering;
  • PCA.

1.6.2 Clustering

The biclustering based on sample correlations can be usefull to highlight relationships between sample and potential artifacts, espcially regarding batch effect, which would need to be taken into account in subsequent analyses. To run a clustering and plot a dendrogram, we first need to define :

  • The distance metric we want to use. For gene expression analysis, we usually use 1 - cor(method = "pearson").
  • The aggreagation criteria. Here we will use the Ward’s D2 distance.
Clustering : Effect of RMA normalization

Figure 2: Clustering : Effect of RMA normalization

Overall, what we see here is that either before or after RMA, the structure of the dataset is consistent and separate quite well the two treatments (quiescence time-courses (Q) and RAS time courses (RAS)). The RMA allowed to gain further in consistency, and to regain the expected proximity between the initial time points for the quiescence time-cousrse (Q_0) and the RAS timecourses (RAS_0). Nevetheless, we still see some unintended substructures in the correlation matrix after RMA, with samples clustered together by batch rahter than time-points. We should take these artefacts into account before moving forward and remove the variability that is explained by the day the array were processed.

1.8 Visual inspection of batch effect correction

To properly evaluate the impact of the batch effect correction, it is once again important to graphically visualize the data before and after transformation.

1.8.1 Clustering

Clustering : Effect of batch effect correction

Figure 3: Clustering : Effect of batch effect correction

In this dendrogram, we can clearly see the batch effect has been properly removed. All the array belonging to the first batch (with the suffix _1) are not clustered together any more. Instead, arrays that are the replication of a given time point tend to cluster together and appears less distant.

2 Self organizing maps

Normalized log-scaled and filtered expression values were processed using the unsupervised machine learning method implemented in oposSOM to train a self-organizing map (SOM). This algorithm applies a neural network algorithm to project high dimensional data onto a two-dimensional visualization space. In this application, we used a two-dimensional grid of size 60 x 60 metagenes of rectangular topology.

2.2 Explore SOM outputs

2.2.2 D-cluster

To define modules of co-expressed meta-genes, we used a clustering approach relying on distance matrix and implemented in oposSOM. Briefly, clusters of gene expression were determined based on the patterns of the distance map which visualizes the mean Euclidean distance of each SOM unit to its adjacent neighbors. This clustering algorithm finds the SOM units referring to local maxima of their mean distance with respect to their neighbors. These pixels form halos edging the relevant clusters in the respective distance map and enable robust determination of feature clusters in the SOM.

SOM : D-cluster projection

Figure 7: SOM : D-cluster projection

2.2.4 Functional over-representation

We finally performed a gene set over-representation analysis in each cluster considering the MSigDB hallmark gene sets using a right-tail modified Fisher’s exact test and the hypergeometric distribution to provide p-value.

D cluster Pathways -log10(q)
A myc targets v1 1.838
A glycolysis 1.085
A bile acid metabolism 0.673
A peroxisome 0.655
A unfolded protein response 0.596
B mtorc1 signaling 2.07
B unfolded protein response 1.051
B myc targets v2 0.958
B uv response up 0.803
B tnfa signaling via nfkb 0.606
C tnfa signaling via nfkb 24.358
C kras signaling up 16.727
C inflammatory response 9.006
C il2 stat5 signaling 6.627
C hypoxia 6.542
D oxidative phosphorylation 2.59
D tnfa signaling via nfkb 1.553
D estrogen response early 1.131
D il2 stat5 signaling 1.001
D dna repair 0.79
E tnfa signaling via nfkb 4.969
E complement 2.454
E epithelial mesenchymal transition 2.131
E interferon gamma response 1.472
E inflammatory response 1.386
F epithelial mesenchymal transition 2.203
F hypoxia 1.71
F il6 jak stat3 signaling 1.586
F inflammatory response 1.297
F mitotic spindle 1.088
G epithelial mesenchymal transition 17.227
G myogenesis 8.532
G mtorc1 signaling 4.283
G angiogenesis 4.078
G coagulation 3.859
H apoptosis 0.799
H tgf beta signaling 0.661
H mitotic spindle 0.626
H epithelial mesenchymal transition 0.623
H unfolded protein response 0.376
I unfolded protein response 1.581
I uv response up 1.175
I heme metabolism 0.872
I androgen response 0.816
I mtorc1 signaling 0.52
J g2m checkpoint 67.642
J e2f targets 65.605
J mitotic spindle 29.165
J spermatogenesis 9.941
J dna repair 4.809

3 Information-theory based analyses

To evaluate transcriptome diversity and specialization, we used the approaches derived from the information theory. Briefly, considering the relative expression of the \(i^{th}\) gene in the \(j^{th}\) tissue, we defined:

  • the transcriptome diversity as:

\[H_j = \sum_{i=1}^g p_{ij} * log_2 (p_{ij})\]

  • the transcriptome specialization as:

\[\sigma_j = \sum_{i=1}^g p_{ij} * S_i \]

  • with the gene specificity for the \(i^{th}\) gene:

\[S_i = \frac{1}{t} * \left( \sum_{j=1}^t \left( \frac {p_{ij}}{p_i} \right) * log_2\left(\frac {p_{ij}}{p_i}\right)\right) \]

  • and the average relative expression for the \(i^{th}\) gene across samples:

\[p_i = \frac{1}{t}*\sum_{j=1}^t p_{ij}\]

Information theory : Entropy vs specificity

Figure 9: Information theory : Entropy vs specificity

4 Differential expression analysis

Normalized log-scaled and filtered expression data related to the quiescence and the RAS OIS time series were further considered for differential analysis with limma. To define a RAS OIS specific transcriptomic signature, we proceeded in three steps, each relying on linear mixed model cubic B-splines, as nonlinear response patterns are commonly encountered in time course biological data.

  • First, we defined probes responding over time to ER:RAS induction.
  • Second, we considered all together the quiescence and the RAS OIS time series, as well as the interaction between time and treatment, and defined probes responding to one or the other treatment over time, as well as probes responding differently between the two treatments at any time point.
  • We finally defined the set of probes responding consistently to both treatment and time and removed these probes from the global set of probes responding to the ER:RAS induction defined at the first step

Moderated F-statistics that combine the empirical Bayes moderated t-statistics for all contrasts into an overall test of significance for each probe were used to assess the significance of the observed expression changes. At any step of this workflow, p-values were corrected for multiple testing using the FDR approach for a stringent significance level of 0.005.

For validation purposes, we also wanted to compress the RAS OIS time-series and achieve a volcano plot representation. To do so, we’ve computed the maximal absolute log2 fold change in expression in the RAS OIS time series considering T0 as the reference and selected up and down regulated probes using an absolute log2 fold change cutoff at 1.2 and a corrected p-value cutoff of 0.005. We then build a scatter-plot plotting the log10 significance versus log2 fold-change on the y and x axes, respectively. Probes responding consistently to both ER:RAS induction and quiescence were finally over-plotted.

4.2 Fit the model to the data

# Load the library limma
library(limma)

# Fit the models
fit_limma_RAS  <- lmFit(data_rma_no_batch[,-c(1:12)], design_limma_RAS)
fit_limma_time <- lmFit(data_rma, design_limma_time)
fit_limma_full <- lmFit(data_rma, design_limma_full)

##--> Genes DE through time during RAS timecourse
fit_limma_RAS_2   <- eBayes(fit_limma_RAS)
sig_limma_RAS     <- topTable(fit_limma_RAS_2, coef=c(2:6), adjust="BH", number = Inf, p.value = .005)
sig_limma_RAS_all <- topTable(fit_limma_RAS_2, coef=c(2:6), adjust="BH", number = Inf)

##--> Genes DE between treatment at any timepoint
contrast_Q_vs_RAS <- makeContrasts(design_splines_full1.design_full.TreatmentQ -
                                     design_splines_full1.design_full.TreatmentRAS,
                                   design_splines_full2.design_full.TreatmentQ -
                                     design_splines_full2.design_full.TreatmentRAS,
                                   design_splines_full3.design_full.TreatmentQ -
                                     design_splines_full3.design_full.TreatmentRAS,
                                   levels = design_limma_full)
fit_limma_Q_vs_RAS     <- contrasts.fit(fit_limma_full, contrast_Q_vs_RAS)
fit_limma_Q_vs_RAS_2   <- eBayes(fit_limma_Q_vs_RAS, robust = TRUE)
sig_limma_Q_vs_RAS_all <- topTable(fit_limma_Q_vs_RAS_2, adjust="BH", number = Inf)

##--> Genes DE through time, whatever the treatment
fit_limma_time_2 <- eBayes(fit_limma_time, robust = TRUE)
sig_limma_time <- topTable(fit_limma_time_2, coef=c(3:5), adjust="BH", number = Inf)


##--> Genes that are not DE between treatment
not_DE_Q_vs_RAS <- setdiff(rownames(data_rma),
                           rownames(sig_limma_Q_vs_RAS_all[which(sig_limma_Q_vs_RAS_all$P.Value<=.2),]))

##--> Gene that are consistenly DE through time among cell types 
DE_Q_vs_RAS_concoord <-  sig_limma_time[match(not_DE_Q_vs_RAS, row.names(sig_limma_time), nomatch = 0),]
DE_Q_vs_RAS_concoord  <- DE_Q_vs_RAS_concoord[order(DE_Q_vs_RAS_concoord$adj.P.Val),]
DE_Q_vs_RAS_concoord  <- DE_Q_vs_RAS_concoord[which(DE_Q_vs_RAS_concoord$adj.P.Val < .005), ]

4.4 Volcano plot

##--> Create a function to compute the max absolute logFC
get_max_absolut_FC <- function (x) {
  fc <- x[1:6] - x[1]
  max_fc <- max(fc)
  min_fc <- min(fc)
  if (abs(max_fc) > abs(min_fc)) {abs_fc <- max_fc}
  if (abs(max_fc) < abs(min_fc)) {abs_fc <- min_fc}
  return(abs_fc)
}

##--> Select expression matrix for DE whatever treatment
average_transcriptomic_data_RAS <- data.frame(average_transcriptomic_data[,c(7:12)])

##--> Compute the max absolute logFC for each treatment
average_transcriptomic_data_RAS$Max_LFC_RAS <- as.numeric(apply(average_transcriptomic_data_RAS[,c(1:6)],
                                                                1, get_max_absolut_FC))

##--> Add the pval and the qval
average_transcriptomic_data_RAS$Pval <- sig_limma_RAS_all[match(row.names(average_transcriptomic_data_RAS),
                                         row.names(sig_limma_RAS_all)),8]
average_transcriptomic_data_RAS$Qval <- sig_limma_RAS_all[match(row.names(average_transcriptomic_data_RAS),
                                         row.names(sig_limma_RAS_all)),9]

##--> Annotation
average_transcriptomic_data_RAS$Gene_Symbol <- annotation_bioconductor[match(
  row.names(average_transcriptomic_data_RAS),
  annotation_bioconductor[,1]), ]$HGNC_Symbol

##--> Format the data
volcano_data <- average_transcriptomic_data_RAS
volcano_data$Change <- "STABLE"
volcano_data$Change[which(volcano_data$Qval < .005 
                    & volcano_data$Max_LFC_RAS 
                    > .48)] <- "UP"
volcano_data$Change[which(volcano_data$Qval < .005 
                    & volcano_data$Max_LFC_RAS 
                    < - .48)] <- "DOWN"

volcano_data <- volcano_data[order(volcano_data$Qval),]

xlim <- c(-6, -4)
ylim <- c(6, NA)

##--> Volcano plot
ggplot(data = volcano_data, aes(y = -log10(Qval), x = Max_LFC_RAS)) +
  geom_point(aes(color = Change)) +
  geom_text_repel(data = subset(volcano_data, Change == "DOWN")[c(1:20),],
                    aes(label = Gene_Symbol),
                    segment.color = "black", size = 3,  force = 1, nudge_x = -0.25) +
  geom_text_repel(data = subset(volcano_data, Change == "UP")[c(1:20),],
                    aes(label = Gene_Symbol),
                    segment.color = "black", size = 3,  force = 1, nudge_x = 0.25) +
  geom_hline(yintercept = -log10(.005)) +
  geom_vline(xintercept = c(-.48,.48)) +
  scale_x_continuous(limits = c(-7, 7), name = "Max Abs Log2 FC") +
  scale_y_continuous(name = "-log10 Qval", limits = c(0,7)) +
  geom_point(data = volcano_data[which(volcano_data$Change == "CONSISTENT"),],
             aes(y = -log10(Qval), x = Max_LFC_RAS), color = "grey50") +
  scale_color_manual(values=c("chartreuse3", "grey", "firebrick1")) +
  theme_classic() + theme(legend.position="none")
Differential analysis for RAS time-courses: volcano plot

Figure 10: Differential analysis for RAS time-courses: volcano plot

4.5 Plot the expression trajectories for some probes

5 Clustering with WGCNA

Probes constitutive of the RAS OIS specific transcriptomic signature were clustered using the weighted gene correlated network analysis approach implemented in the WGCNA R package. Standard WGCNA parameters were used for the analysis, with the exceptions of soft-thresholding power which was defined using methods described by (Langfelder et al., 2008) and set at 18. The 7 co-expressed probe clusters identified were further functionally characterized using gene set over-representation tests. The same approach as previously described for the SOM-defined clusters was used.

5.1 Inputs and parameters

5.1.5 Defining gene modules

#---> Define the cutoff for gene module size
min_module_size <- 200 

#---> Cut the clustering tree
dynamic_mods <- cutreeDynamic(dendro = gene_tree,
                            distM = diss_tom,
                            deepSplit = 3,
                            pamRespectsDendro = FALSE,
                            minClusterSize = min_module_size)

dynamic_colors <- labels2colors(dynamic_mods)

#---> Clustering eigengenes
library(ggdendro)
ME_list <- moduleEigengenes(data_WGCNA, color = dynamic_colors, softPower = soft_power)
ME <- ME_list$eigengenes
ME_diss <- 1 - cor(ME)
ME_tree <- as.dendrogram(hclust(as.dist(ME_diss), method= "average"))

P1 <- ggdendrogram(ME_tree) + 
  ggtitle("Clustering of module eigengenes")

#---> Heat map for this set of clusters
data_transcriptomic_WGCNA_module_RAS <- t(data_WGCNA)[order(ME_list$validColors),]
annot_WGCNA_module_RAS <- data.frame(Module = ME_list$validColors[(order(ME_list$validColors))])
row.names(annot_WGCNA_module_RAS) <- row.names(data_transcriptomic_WGCNA_module_RAS)

color <- as.character(unique(annot_WGCNA_module_RAS[,1]))
names(color) <- as.character(unique(annot_WGCNA_module_RAS[,1]))
color <- list(Module = color)

P2 <- pheatmap(cbind(data_transcriptomic_WGCNA_module_RAS[,c(7:12)],
                     data_transcriptomic_WGCNA_module_RAS[,c(1:6)]),
               cluster_rows = FALSE,
               cluster_cols = FALSE,
               scale = "row",
               show_rownames = FALSE,
               color = colorRampPalette(c("blue", "black", "gold"))(100),
               annotation_row = annot_WGCNA_module_RAS,
               annotation_colors = color, silent = TRUE,
               main = "Expression in module eigengenes")

#---> Combine the two plots
do.call(grid.arrange,list(P1,P2[[4]]))
WGCNA - Initial set of eigengenes

Figure 19: WGCNA - Initial set of eigengenes

5.1.6 Refining gene modules

WGCNA - Comparison of eigengenes before / after merging

Figure 20: WGCNA - Comparison of eigengenes before / after merging

#---> Clustering eigengenes
module_colors <- merged_colors
color_order <- c("grey", standardColors(50))
module_labels <- match(module_colors, color_order )-1
ME <- merged_ME
ME_diss_merged <- 1 - cor(ME)
ME_tree <- hclust(as.dist(ME_diss_merged), method= "average")

P1 <- ggdendrogram(ME_tree) + 
  ggtitle("Clustering of module eigengenes")

#---> Heat map for the final set of clusters
data_transcriptomic_WGCNA_module_RAS <- t(data_WGCNA)[order(module_colors),]
data_transcriptomic_WGCNA_module_RAS <- data.frame(data_transcriptomic_WGCNA_module_RAS,
                                                  Modules = as.factor(as.character(
                                                    module_colors[order(module_colors)])))

order <- as.factor(c("magenta", "purple", "pink", "yellow","blue",
                     "greenyellow", "black", "green", "brown"))

data_transcriptomic_WGCNA_module_RAS$Modules <-as.character(
                    data_transcriptomic_WGCNA_module_RAS$Modules)
data_transcriptomic_WGCNA_module_RAS[which(data_transcriptomic_WGCNA_module_RAS$
                                             Modules =="purple"),]$Modules <- "magenta"
data_transcriptomic_WGCNA_module_RAS <- data_transcriptomic_WGCNA_module_RAS[
                    order(match(data_transcriptomic_WGCNA_module_RAS$Modules,order)),]
data_transcriptomic_WGCNA_module_RAS <- data_transcriptomic_WGCNA_module_RAS[
                    -which(data_transcriptomic_WGCNA_module_RAS$Modules == "greenyellow"),]

color <- as.character(unique(data_transcriptomic_WGCNA_module_RAS$Modules))
names(color) <- as.character(unique(data_transcriptomic_WGCNA_module_RAS$Modules))
color <- list(Module = color)
annot <- data.frame(Module = data_transcriptomic_WGCNA_module_RAS$Modules)
row.names(annot) <- row.names(data_transcriptomic_WGCNA_module_RAS)

P2 <- pheatmap(cbind(data_transcriptomic_WGCNA_module_RAS[,c(7:12)],
                     data_transcriptomic_WGCNA_module_RAS[,c(1:6)]),
               cluster_rows = FALSE,
               cluster_cols = FALSE,
               scale = "row",
               show_rownames = FALSE,
               color = colorRampPalette(c("blue", "black", "gold"))(100),
               annotation_row = annot,
               annotation_colors = color, silent = TRUE,
               main = "Expression in module eigengenes")

#---> Combine the two plots
do.call(grid.arrange,list(P1,P2[[4]]))
WGCNA - Final set of eigengenes

Figure 21: WGCNA - Final set of eigengenes

5.3 Functional over-representation analysis

#---> Match probes in the data set to the probe IDs in the annotation file
probes <- row.names(data_rma_no_batch)
probes2annot <- match(probes, annotation_bioconductor[,1])

#---> Get the corresponding Entrez_ID
all_IDs <- annotation_bioconductor$Entrez_ID[probes2annot]

#---> Construct the input data set
fun_enrich_WGCNA <- data.frame(t(data_WGCNA))
fun_enrich_WGCNA$Entrez <- as.character(annotation_bioconductor[match(row.names(fun_enrich_WGCNA),
                                                           annotation_bioconductor[,1]),]$Entrez_ID)
fun_enrich_WGCNA$Symbol <- as.character(annotation_bioconductor[match(row.names(fun_enrich_WGCNA),
                                                            annotation_bioconductor[,1]),]$HGNC_Symbol)
fun_enrich_WGCNA <- merge(fun_enrich_WGCNA, annot, by = "row.names")
fun_enrich_WGCNA <- fun_enrich_WGCNA[-which(is.na(fun_enrich_WGCNA$Entrez)),c(14:16)]
fun_enrich_WGCNA <- split(fun_enrich_WGCNA[,1], list(fun_enrich_WGCNA$Module))
fun_enrich_WGCNA <- fun_enrich_WGCNA[order(match(names(fun_enrich_WGCNA),unique(annot$Module)))]

#--->  Over-representation analysis (MSigDB Hallmark)
library(ReactomePA)
library(clusterProfiler)

# Load the gmt file
GMT <- read.gmt("./tools/h.all.v6.0.entrez.gmt")

# Run the over-representation analysis 
fun_enrich_res <- data.frame(compareCluster(fun_enrich_WGCNA, fun='enricher',
                                            TERM2GENE=GMT, pvalueCutoff=1, qvalueCutoff = 1,
                                            minGSSize = 10))
# Filter the data
fun_enrich_res_2 <- data.frame()
for (i in unique(fun_enrich_res$Cluster)) {
  x <- fun_enrich_res[which(fun_enrich_res$Cluster == i),]
  x[which(x$qvalue > .2),]$qvalue <- NA
  x <- x[order(x$qvalue),]
  fun_enrich_res_2 <- rbind(fun_enrich_res_2, x)
}

fun_enrich_res_2 <- na.omit(fun_enrich_res_2)

# Clustering
library(reshape)
fun_enrich_res_tmp <- fun_enrich_res_2[,c("ID", "qvalue", "Cluster")]
fun_enrich_res_tmp <- cast(fun_enrich_res_tmp, Cluster~ID, value = "qvalue")
fun_enrich_res_tmp <- data.matrix(fun_enrich_res_tmp)
fun_enrich_res_tmp[is.na(fun_enrich_res_tmp)] <- 0
row.names(fun_enrich_res_tmp) <- fun_enrich_res_tmp[,1]
fun_enrich_res_tmp <- fun_enrich_res_tmp[,-1]
order_fun_enrich <- hclust(as.dist(1-cor(fun_enrich_res_tmp)), method = "ward.D2")$order

fun_enrich_res_2$Description <- factor(fun_enrich_res_2$Description,
                                       levels =colnames(fun_enrich_res_tmp)[order_fun_enrich] )
fun_enrich_res_2$Cluster <- factor(fun_enrich_res_2$Cluster, levels=unique(annot$Module))
fun_enrich_res_2$Percent <- unlist(lapply(strsplit(fun_enrich_res_2$GeneRatio, "/"), function(x){as.numeric(x[1])/as.numeric(x[2])}*100))

# Plot
ggplot(fun_enrich_res_2, aes(y = Description, x = Cluster, fill = -log10(pvalue), size = Percent)) +
  geom_point(shape = 21, stroke = .5) +
  scale_fill_gradient(low = "firebrick1", high = "chartreuse3") +
  theme_classic() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1), axis.text.y = element_text(size = 8))
WGCNA - Functional over-representation analysis (MSigDB hallmark)

Figure 23: WGCNA - Functional over-representation analysis (MSigDB hallmark)

6 Export results

6.2 Export results as a hg19 .bed file

7 Session Info

## R version 3.4.4 (2018-03-15)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.5
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] fr_FR.UTF-8/fr_FR.UTF-8/fr_FR.UTF-8/C/fr_FR.UTF-8/fr_FR.UTF-8
## 
## attached base packages:
##  [1] splines   stats4    parallel  stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] R.utils_2.6.0                   R.oo_1.22.0                    
##  [3] R.methodsS3_1.7.1               rtracklayer_1.38.3             
##  [5] GenomicRanges_1.30.3            GenomeInfoDb_1.14.0            
##  [7] reshape_0.8.7                   GSEABase_1.40.1                
##  [9] graph_1.56.0                    annotate_1.56.2                
## [11] XML_3.98-1.11                   clusterProfiler_3.6.0          
## [13] ReactomePA_1.22.0               DOSE_3.4.0                     
## [15] ggdendro_0.1-20                 WGCNA_1.63                     
## [17] fastcluster_1.1.25              dynamicTreeCut_1.63-1          
## [19] RColorBrewer_1.1-2              bindrcpp_0.2.2                 
## [21] kableExtra_0.9.0                dplyr_0.7.6                    
## [23] data.table_1.11.4               ggrepel_0.8.0                  
## [25] limma_3.34.9                    sva_3.26.0                     
## [27] BiocParallel_1.12.0             genefilter_1.60.0              
## [29] mgcv_1.8-24                     nlme_3.1-137                   
## [31] pheatmap_1.0.10                 gridExtra_2.3                  
## [33] ggplot2_2.2.1                   reshape2_1.4.3                 
## [35] hta20transcriptcluster.db_8.7.0 org.Hs.eg.db_3.5.0             
## [37] AnnotationDbi_1.40.0            biomaRt_2.36.1                 
## [39] pd.hta.2.0_3.12.2               DBI_1.0.0                      
## [41] RSQLite_2.1.1                   oligo_1.42.0                   
## [43] Biostrings_2.46.0               XVector_0.18.0                 
## [45] IRanges_2.12.0                  S4Vectors_0.16.0               
## [47] Biobase_2.38.0                  oligoClasses_1.40.0            
## [49] BiocGenerics_0.24.0             BiocStyle_2.6.1                
## 
## loaded via a namespace (and not attached):
##   [1] backports_1.1.2            fastmatch_1.1-0           
##   [3] Hmisc_4.1-1                plyr_1.8.4                
##   [5] igraph_1.2.1               lazyeval_0.2.1            
##   [7] fastICA_1.2-1              robust_0.4-18             
##   [9] digest_0.6.15              GOSemSim_2.4.1            
##  [11] foreach_1.4.4              BiocInstaller_1.28.0      
##  [13] htmltools_0.3.6            GO.db_3.5.0               
##  [15] arules_1.6-1               magrittr_1.5              
##  [17] checkmate_1.8.5            memoise_1.1.0             
##  [19] fit.models_0.5-14          cluster_2.0.7-1           
##  [21] doParallel_1.0.11          readr_1.1.1               
##  [23] matrixStats_0.53.1         prettyunits_1.0.2         
##  [25] colorspace_1.3-2           rappdirs_0.3.1            
##  [27] rrcov_1.4-4                blob_1.1.1                
##  [29] pixmap_0.4-11              rvest_0.3.2               
##  [31] xfun_0.2                   crayon_1.3.4              
##  [33] RCurl_1.95-4.10            bindr_0.1.1               
##  [35] impute_1.52.0              survival_2.42-4           
##  [37] iterators_1.0.9            ape_5.1                   
##  [39] glue_1.2.0                 oposSOM_1.16.0            
##  [41] gtable_0.2.0               zlibbioc_1.24.0           
##  [43] DelayedArray_0.4.1         graphite_1.24.1           
##  [45] DEoptimR_1.0-8             scales_0.5.0              
##  [47] mvtnorm_1.0-8              som_0.3-5.1               
##  [49] Rcpp_0.12.17               viridisLite_0.3.0         
##  [51] xtable_1.8-2               progress_1.2.0            
##  [53] htmlTable_1.12             reactome.db_1.62.0        
##  [55] foreign_0.8-70             bit_1.1-14                
##  [57] preprocessCore_1.40.0      Formula_1.2-3             
##  [59] tsne_0.1-3                 htmlwidgets_1.2           
##  [61] httr_1.3.1                 fgsea_1.4.1               
##  [63] acepack_1.4.1              ff_2.2-14                 
##  [65] pkgconfig_2.0.1            nnet_7.3-12               
##  [67] tidyselect_0.2.4           labeling_0.3              
##  [69] rlang_0.2.1                munsell_0.5.0             
##  [71] tools_3.4.4                fdrtool_1.2.15            
##  [73] evaluate_0.10.1            stringr_1.3.1             
##  [75] yaml_2.1.19                knitr_1.20                
##  [77] bit64_0.9-7                robustbase_0.93-1         
##  [79] purrr_0.2.5                DO.db_2.9                 
##  [81] xml2_1.2.0                 compiler_3.4.4            
##  [83] rstudioapi_0.7             curl_3.2                  
##  [85] affyio_1.48.0              tibble_1.4.2              
##  [87] statmod_1.4.30             pcaPP_1.9-73              
##  [89] stringi_1.2.3              highr_0.7                 
##  [91] lattice_0.20-35            Matrix_1.2-14             
##  [93] pillar_1.2.3               bitops_1.0-6              
##  [95] qvalue_2.10.0              R6_2.2.2                  
##  [97] latticeExtra_0.6-28        bookdown_0.7              
##  [99] affxparser_1.50.0          codetools_0.2-15          
## [101] MASS_7.3-50                assertthat_0.2.0          
## [103] SummarizedExperiment_1.8.1 rprojroot_1.3-2           
## [105] GenomicAlignments_1.14.2   Rsamtools_1.30.0          
## [107] GenomeInfoDbData_1.0.0     hms_0.4.2                 
## [109] grid_3.4.4                 rpart_4.1-13              
## [111] tidyr_0.8.1                rvcheck_0.1.0             
## [113] rmarkdown_1.10             scatterplot3d_0.3-41      
## [115] base64enc_0.1-3